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Abstract. We present a method of computing with matrices over very small 
finite fields of size larger than 2. Specifically, we show how the Method of 
Four Russians can be efficiently adapted to these larger fields, and introduce a 
row-wise matrix compression scheme that both reduces memory requirements 
and allows one to vectorize element operations. We also present timings which 
confirm the efficiency of these methods and exceed the speed of the fastest 
implementations the authors are aware of. 



1. Introduction 

Using the special properties of the finite field F 2 and the binary-based nature 
of modern computing, a wealth of specialized algorithms and optimized implemen- 
tations have been developed for doing linear algebra over F 2 . On the other hand, 
much work has gone into creating fast linear algebra over word sized primes as a 
basic building block of multi-modualar and p-adic methods. In this paper first we 
present a method of computing with matrices over finite fields that arc significantly 
smaller than a single machine word, but larger than F 2 . Such matrices arise for 
example in number theory [8] and graph theory [14j [20] . 

We show how the Method of Four Russians can be efficiently adapted to finite 
fields larger than F 2 , and introduce a row- wise matrix compression scheme that 
both reduces memory requirements and allows one to vectorize element operations. 
As row addition is the essential operation in the method of the four Russians, these 
two techniques go very well together. We demonstrate the practicality of these 
methods with timings. 

In section [2] we present the Method of Four Russians for multiplication of ma- 
trices, and show how it can be used for the fields in question. In section[3]we show 
how the idea of bitslicing yields a convenient packed representation, and compare 
it to the representation used for very small prime fields in [9] . In section [4] we give 
the specific representations used with justification, and timings are given in section 
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2. Method of Four Russians 

The Method of Four Russians (M4RM) was first introduced by Arlazarov, Dinic, 
Kronrod, and Faradzev in the context of Graph theory [TJ 0] and has traditionally 
been used for boolean matrices. It has a runtime complexity of 0(n 3 / log n), and 
has been extended to system solving and matrix inversion in additino to matrix 
multiplication [5 . Though this has worse asymptotic complexity than Strassen- 
Winograd and other lower-exponent matrix multiplication algorithms, the actual 
cutoff is often high enough to make it competitive for medium-sized problems [5] . 
It can also effectively be used as a base-case for asymptotically faster algorithms 

Consider the product of two matrices C = AB where A is an m x I matrix and 
B is an ( x n matrix. Let C{ denote the i-th row of C. Then the rows of C can be 
viewed as linear combinations of the rows in B, with coefficients selected according 
to the i-th row of A. That is, 

i-i 

Cj = AiB = ^ a ijBj- 

3=0 

Let A; be a small integer and, for simplicity, assume for the moment that k divides 
I. We can write this sum as 

;-i i/fe-i fc-i 

Cj = a ijBj = ^ a i(sk+t) Bsk+t ■ 

When the field of definition K is small, precomputations can be used to speed up 
the sums ^+_q atB sk +t for all (ao,...,at) G K k , which can be shared for all rows 
Ci . When K = ¥2 this is done by creating lookup tables of binary combinations of 
the rows. Obviously, precomputing all possible linear combinations of the rows has 
diminishing returns when the cardinality of the field is any larger than two, so we 
adapt the method as follows. 

Let ao, a r be an additive basis for K, with maps tf>o, 4> r : K — > {0,1} such 
that a — Y^d=o a d ( t > d(a,) for all a e K. Then we can write 

t— 1 t-i / r \ r t-i 

^ a t B sk+t = I a d (f> d {a t ) J B sk+t = ^ a d ^ <pd{a t )B sk+t 

t=0 t=0 \d=0 J d=0 t=a 

where the inner sum is now over the binary combinations of rows of B, which is 
much more amenable to precomputation. One also has some freedom in choosing 
the additive basis such that the scalar products are easy to compute, for example 
letting 1 and -1 be in the basis, or choosing a power basis so Horner's rule can be 
applied. 

3. BlTSLICING AND MATRIX COMPRESSION 

The the most basic unit of arithmetic on a modern processor is a multi-bit word, 
and operating on individual bits usually cannot be done more cheaply (in fact, 
sometimes it's more expensive). However, if one is doing the same operations on 
many values, a standard trick is to pack multiple values into a single word and then 
do word-sized arithmetic on the packed values. 
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For example, one might want add vectors over Z/4Z by packing every element 
into two bits, 

(3,0, 1,2,0, 1,3,2) = 11 00 01 10 00 01 11 10. 

One would like to add two vectors of this form with standard integer addition, but 
of course, this does not work — instead, one must pack every element into three bits 

(3,0,1,2,0,1,3,2) = 011 000 001 010 000 001 011 010 

so we can handle the overflow. Addition is now normal integer addition and remov- 
ing the cary bit (reduction mod 4). 

c <— a + b 

d <- c A 011 011 ■ • • 011 

A priori, this looks pretty good: 21 element additions in 2 operations on a 64- 
bit processor. But every third bit is wasted, and if we wanted to perform scalar 
multiplication, even more padding would be needed. If only we could define our own 
arithmetic which ignores the carry, we'd be in business. This is done by bitslicing. 

In cryptography, for example, bitslicing has been used to speed up the compu- 
tation of 5-boxes in the DES cypher [6l [15j [13] . Rather than storing the bits of a 
single element as adjacent bits in a single word, we store them as parallel bits in 
multiple words. So, we represent a vector of elements over Z/4Z as a pair of words, 

(0 1 2 1 2 2) - 00010011 

{U,V,1,4,V,1,Z,Z) - 0{)100100 • 

Addition mod 4 of a pair of two-bit numbers oiao and bibo can be done with four 
bit operations 

r <- a © b 

n <- (aie6 1 )®(a A6 ). 

If we view the inputs as machine words rather than individual bits, and perform 
bitwise operations on words, the addition formula holds at each bit of the inputs. 
One a 64-bit machine we can now add 64 elements with 4 operations, or 16 elements 
per instruction; whereas above, we only add 10.5 per instruction. Any operation 
that can be expressed in terms of boolean formulas can be vectorized in this way. 

The classical packing method is used in [9] where multiple matrix entries into a 
single double-precicion floating point values and using optimized numerical linear 
algebra routines in the spirit of FFLAS/FFPACK [TT]. Simultaneous Modular 
Reduction [TU] is used to perform the modular reductions. This has the advantage 
that one can leverage the highly-optimized and tuned floating point packages such 
as ATLAS, as well as getting multi-core or hardware acceleration for free if the 
underlying BLAS is compiled to take advantage of it. Unfortunately the matrix 
dimension and amount of padding needed for a dot product puts a rather severe 
upper bound on the number of entries that can be packed in a double. Specifically, 
at least log 2 n(p— l) 2 bits need to be used per field element to compute a dot product 
of length n. This means, assuming a 53-bit mantissa, only 3 entries could be stored 
per double when multiplying a 1000 x 1000 matrix over F5 or F7. This packing 
scheme also has the disadvantage that left operand, right operand, or product are 
stored using different compression schemes. 
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4. Arithmetic over Specific Fields 

To actually implement the algorithm for a specific field, one needs to find short 
boolean formulas which express the arithmetic in the field. More accurately, we are 
interested sequential program on n inputs, that is, a sequence of tuples, 

(*o, {«o, jo}), {hji}), • ■ • , (*e, {kJe}) 

where each *k is any of (A,V, ©), and — n < u- < jk < k. Then, we can evaluate 
such a program by the recurrence 

v k <- v lk * fc v jk , 

where V- n , . . . are the inputs to the program. Trying to find small programs 
by hand is a fun exercise, but it is difficult to prove such programs minimal. 
The number of the sequential programs on n inputs with length t is given by 

fc=0 ^ ' 

For two-bit fields, a binary arithmetic operation has 4 inputs. From the naive 
count above, there are more than 128 million such programs of length 5, and over 
13 billion of length 6. This is still within the reach of an exhaustive computer search 
which we have performed. For larger fields, the number of inputs is six or more, 
and the expected minimal program length is larger as well, so brute force searching 
methods seem to be prohibitively expensive. 

We note that this is equivalent to boolean logic and circuit minimization, and 
a considerable amount of reserach has been gone into optimizing such things. Un- 
fortunately, the standard methods such as Karnaugh maps [12] and the Espresso 
algorithm 16J performed very poorly on these particular circuits and often pro- 
duced far from optimal results. For example, in our representation of F3, Logic 
Friday [T7] (which implements the Espresso algorithm) produces circuits with 12 
or more gates — twice as large what is actually required. 

4.1. Arithmetic over F3. The application of the Method of Four Russians to 
non-binary fields began, naturally, with an investigation into its feasibility over 
F 3 . We use the additive basis {1,-1}. One may be tempted to use the binary 
representation 

= 00, 1 = 01,-1 = 10, 

but we find that the minimal addition requires 7 operations per word in this rep- 
resentation. Instead, we use the representation 

= 00, 1 = 10, -1 = 11, 

so the first bit xq marks units, and the second bit x\ indicates the sign of the 
element. In this representation, vector addition requres only 6 bitwise operations 
per pair of words, 

s <— x Q © yx © xi 
t «- xi © y © V\ 
r «- (xq © y\) A (x\ © yo) 

n «- s v t. 
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Next, we note that negation requires one operation, 

r ,n <- a a ,a ®ai 

so we can automatically perform vector subtraction in 7 operations. However, we 
can do one better: 

t <- x yo 

n <- (tes/i)A(i/o®n). 

We compare this to the classical packing method, in which each element is packed 
into three bits of a word. Here, a row sum is computed by 

x <- (a + 6) A Oil Oil ••• Oil 
y <- (a + b) A 100 100 • • • 100 
1 

r <— a; + -j/ 

where the division in the last step is performed via a bit shift. For a 64-bit word, 
we perform 21 element additions in 5 operations, compared to 64 additions in 6 
operations. In this case bitslicing more than doubles the speed of computation over 
classical integer packing. 

4.2. Arithmetic over F5 and F7. For F3, we were able to find representatives 
for the field elements giving nice arithmetic formulas. Such representations for 
larger fields, if they exist at all, are quite elusive. However, the standard binary 
representation works fairly well if one relaxes the requirement that representations 
be unique. The additive basis we choose in this case is {1,2,4} which corresponds 
nicely with our representation. Now to use the multiplication algorithm specified 
above, one only needs to specify how to add and double elements using only bit 
operations. For completeness, it is useful to be able to negate and reduce to a 
canonical representative. 

Denote the standard grade-school addition on two binary integers by add. For 
an n and m-bit input, and without loss of generality assuming m < n, this can be 
done with m — 1 full adders and n — m + 1 half adders using a total of 5(m — 1) + 
2(n — m + 1) = 3m + 2n — 3 bit operations. 

For F 7 the "carry" bit from standard 3-bit addition is equal to the unit bit mod 
7. This gives particularly nice formulas. 

• Addition (17 bit operations): 

S3S2S1SQ <— add(a 2 aia , &2&1&0) 
Wiro <- add(s 2 sis ,s 3 ) 

It is easy to see there will not be a carry in the last add as not all of so, s 3 
can be 1. 

• Double (0 bit operations): 

ro,n,r 2 <- a 2 ,a a ,ai 
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• Negate (3 bit operations): 

• Reduce (5 bit operations): 

r <— a © (a V a x V a 2 ) 

ri <— ai © (a V a\ V a 2 ) 

r 2 <— a 2 © (oo V ai V a 2 ) 

Clearly the formulas for F7 generalize to a general n-bit Mersenne prime. 

Things aren't as nice for F5, but one can still find acceptable formulas. We 
introduce an auxiliary function f old5 which takes a 4-bit input s 3 s 2 SiSo and "folds" 
the highest bit into the other three, preserving the value mod 5. 

• fold5: 

The best comprehensible formula we were able to come up with has 13 
operations. 

n ,ni,n 2 <— S2,s 3 ,s^ 

e3e 2 eie <- add(n 2 nin , SiS ) 

ro,n,r 2 <- e A e 3 ,ei A e 3 ,e 2 

Using a computer-assisted search, we found a shorter (8 operation) but 
entirely cryptic formula: 

t <— s 2 V si 
r<i <— s © < 
ri <- (r 2 A s ) © (s 3 © si) 
r <- ( i © s 2 ) V (n A s 3 ) 

• Addition (20 bit operations): 

S3S 2 sis add(a 2 aia , 6 2 fei6 ) 

ro,n,r 2 <- fold5(s 3 s 2 sis ) 

• Double (5 bit operations): 

^0,^1, ^2 ^— f old5(a 2 aia 0) 

• Negate (6 bit operations): 

TQ,r\,r2 *— f old5(aia 0a 2 ) 

• Reduce (6 bit operations): 

t <— a © ai 

u <— (ao A ai) V t 

ro,rx,r 2 u, (u V a 2 ) © a , u © t 

These approaches do not seem to adapt themselves well to primes of a more 
general form. 

Though the rings Z/2 fc Z typically aren't very interesting rings to do linear alge- 
bra over, it is worth noting that these rings lend themselves to very short formulas 
of the form above by simply ignoring the last carry bit. For example, in Z/8Z one 
gets an 11 operation add, a operation double, and a 7 operation negate, with the 
benefit that the representation is unique. 
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4.3. Non-prime fields. Similar methods can be applied to extension fields, and it 
is trivial to come up with particularly nice formulas for F 2 2, F 2 3, etc. For a general 
F p n , elements can be added by n repeated applications of the addition formula for 
F p , and the additive basis can be chosen to be all products of the power basis of 
Fpn with the additive basis if F p which is n times as large. This allows one to 
perform multiplication in n 2 + o(n 2 ) times the number of bit operations needed to 
multiply over the base field. However, even more substantial gains can be achieved 
by considering bitslicing at the level of matrices rather than rows. Consider the 
matrices 

A = A + aAx and B = B a + aB x 

Where A n ,Ai,B , B\ are over F 2 and a is a generator of F 2 2. Using Karatsuba and 
reducing modulo x 2 + x + 1 , one can compute their product as 

AB = (A Q B Q + AiBi) + a ((A + i4i)(B + #i) + A Q B ) • 

This requires only three matrix multiplies over F 2 , a significant advantage. In 
general one can view a matrix A over F p ™ as 

A = A + aAi + h a n ~ x A n _ x 

where each A^ is a matrix over ¥ p . One can then use fast polynomial multiplication 
techniques to reduce the number of matrix multiplications for a product of two ma- 
trices of this form, and reduction by the defining polynomial only involves addition 
and possibly multiplication by a scalar. 

Unfortunately Toom-Cook multiplication requires more distinct elements than 
may be available in the base field, but (potentially repeated use of) Karatsuba 
works in any field, and trinomials a and b can be multiplied in any field with 6 
coefficient multiplies using the Karatsuba-likc formula 

Co = ao^o 

ci = a Q bi + aib = (a + ai)(b + h) - a b - aih 

c 2 = a a b 2 + ai&i + a 2 b a = (a + a 2 )(6 + b 2 ) - a b a - a 2 b 2 + aib\ 

c 3 = a 2 bi + a x b 2 = (a 2 + a^fo + bi) - a 2 b 2 - a^i 

c 4 = a 2 b 2 . 

which is significantly better than the 9 multiplies using elementary polynomial 
multiplication, so provides an advantage for cubic extension fields. Over fields 
with 5 or more elements, Toom-3 multiplies trinomials with 5 coefficient multiplies. 
When the matrix dimensions are much larger than the degree of the extension 
and the base field has enough elements, the large number of additions in higher- 
degree Toom-Cook algorithms can still be offset by saving a single matrix multiply. 
There is likely to be an additional constant speedup as the elements manipulated 
in the innermost loops of the linear algebra routines are algebraically simpler, and 
the smaller footprint of the matrix entries results in better memory locality across 
the matrix. Further, this enables one to leverage optimized base field code for 
extension fields instead of writing extensive amounts of code from scratch, or overly 
generalizing the code used to compute linear algebra over small prime fields. This 
may also be useful for doing arithmetic with matrices over number fields. 
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100 
B. M. 


500 
B. M. 


1000 
B. M. 


2500 
B. M. 


F 3 


0.032 0.047 


1.68 2.91 


12.2 21.4 


199 266 


F 5 


0.110 0.143 


6.47 8.62 


49.4 62.7 


742 848 


F 7 


0.105 0.181 


6.04 10.9 


45.7 79.3 


672 1070 


F 2 2 


0.091 0.037 


1.89 2.13 


9.5 15.1 


132 203 


F 2 3 


0.187 0.101 


3.85 5.50 


20.6 40.1 


261 499 


F 3 2 


0.097 0.842 


5.22 62.0 


37.6 444.0 


601 6700 



Table 1: Time to multiply n-dimensional square matrices over F g in milliseconds. 
Author Bitslicing implementations vs. Magma V2.15-3 on a 2.6GHz Opteron ma- 
chine. 



5. Implementation and Timings 

We have implemented matrix multiplication methods over F3, F5 and F7, as well 
as quadratic and cubic extensions of these fields, using the ideas presented above. 
We also implemented F 2 2 and F 2 3 using the M4RI library for arithmetic over F 2 . 
In each case, our implementations are nearly always faster than Magma [7 whose 
finite field linear algebra are the fastest known to the authors (see table [lj . 

We also compare our implementation with the FFLAS routine f gemm for F p and 
using the Givaro Zech log representation for F 2 t, both part of LinBox [19]. It 
should be noted that these are both much better suited to larger fields. Though 
we don't have an optimized implementation of the packing scheme proposed in |9J, 
an effective upper bound can be placed by calculating the number of field elements 
that can be packed into a double and performing an the appropriately-sized floating 
point matrix multiply. A comparison for two specific fields can be seen in figure 
[l] Though asymptotically faster algorithms are used, we normalize against the 
classical 0(n 3 ) to give an effective number of finite field operations per second 
(FFops). The jigsaw effect is due to the Method of Four Russians being sensitive 
to how close the matrix dimensions lie to a word boundary, and the sudden drop 
in efficiency for packed double is the transition from 4 elements per double to 3. 

Our implementation will be included in the open source math software Sage |18j . 
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